Genome-wide association study (GWAS)

This notebook demonstrates conducting a genome-wide association study using the public 1000 Genomes dataset stored in BigQuery. Specifically we will:

Use the %%sql statement to write and execute SQL statements within the notebook
Extract data from BigQuery and create a local dataset that can be manipulated in Python for visualization and further analysis

Experiment design

In this experiment, we'll be identifying variant positions within chromosome 12 that differ significantly between the case and control groups. The case group for the purposes of this notebook will be individuals from the "EAS" (East Asian) super population. Variant data from the 1000 genomes dataset is publicly accessible within BigQuery.



In [1]:

    
import gcp.bigquery as bq
variants_table = bq.Table('genomics-public-data:1000_genomes.variants')
variants_table.schema









    Out[1]:

Classifying per-call variant positions into variant/non-variant groups

We can tally the reference/alternate allele accounts for individual variant positions within chromosome 12. The field call.genotype is an integer ranging from [-1, num_alternate_bases].

A value of negative one indicates that the genotype for the call is ambiguous (i.e., a no-call).
A value of zero indicates that the genotype for the call is the same as the reference (i.e., non-variant).
A value of one would indicate that the genotype for the call is the 1st value in the list of alternate bases (likewise for values >1).



In [2]:

    
%%sql --module allele_counts

SELECT 
    reference_name,
    start,
    reference_bases,
    alternate_bases,
    end,
    vt,    
    call.call_set_name AS call_set_name,
    # 1000 genomes phase 1 data is bi-allelic so there is only ever a single alt
    SUM(0 = call.genotype) WITHIN RECORD AS ref_count,
    SUM(1 = call.genotype) WITHIN RECORD AS alt_count,
FROM
  FLATTEN((
    SELECT
      reference_name,
      start,
      reference_bases,
      alternate_bases,
      end,
      vt,
      call.call_set_name,
      call.genotype,
    FROM
      $variants_table
    WHERE
      reference_name = '12' -- i.e., chromosome 12
    ),
    call)

Let's verify that our allele counts match our expectations before moving on. For any given row, the alternate + reference counts should sum to 2 for this experiment.



In [3]:

    
bq.Query(allele_counts, variants_table=variants_table).sample().to_dataframe()









    Out[3]:






  
    
      
      reference_name
      start
      reference_bases
      alternate_bases
      end
      vt
      call_set_name
      ref_count
      alt_count
    
  
  
    
      0
      12
      19434623
      A
      G
      19434624
      SNP
      HG00261
      2
      0
    
    
      1
      12
      19434623
      A
      G
      19434624
      SNP
      HG00593
      2
      0
    
    
      2
      12
      19434623
      A
      G
      19434624
      SNP
      NA12749
      2
      0
    
    
      3
      12
      19434623
      A
      G
      19434624
      SNP
      HG00150
      2
      0
    
    
      4
      12
      19434623
      A
      G
      19434624
      SNP
      NA19675
      2
      0

Assigning case and control groups

Now we can join our allele counts with metadata available in the sample info table. We'll use this sample metadata to split the set of genomes into case and control groups based upon the super population group.



In [4]:

    
sample_info_table = bq.Table('genomics-public-data:1000_genomes.sample_info')



In [5]:

    
%%sql --module exp_groups

SELECT
  super_population,
  ('EAS' = super_population) AS is_case,
  call_set_name,
  reference_name,
  start,
  reference_bases,
  alternate_bases,
  end,
  vt,
  ref_count,
  alt_count,
FROM $allele_counts AS allele_counts
JOIN $sample_info_table AS samples
  ON allele_counts.call_set_name = samples.sample



In [6]:

    
bq.Query(exp_groups, allele_counts=allele_counts,
                     sample_info_table=sample_info_table,
                     variants_table=variants_table).sample().to_dataframe()









    Out[6]:






  
    
      
      super_population
      is_case
      call_set_name
      reference_name
      start
      reference_bases
      alternate_bases
      end
      vt
      ref_count
      alt_count
    
  
  
    
      0
      EUR
      False
      HG00261
      12
      18330083
      C
      A
      18330084
      SNP
      0
      2
    
    
      1
      EAS
      True
      HG00593
      12
      18330083
      C
      A
      18330084
      SNP
      1
      1
    
    
      2
      EUR
      False
      NA12749
      12
      18330083
      C
      A
      18330084
      SNP
      0
      2
    
    
      3
      EUR
      False
      HG00150
      12
      18330083
      C
      A
      18330084
      SNP
      1
      1
    
    
      4
      AMR
      False
      NA19675
      12
      18330083
      C
      A
      18330084
      SNP
      0
      2

The variants table contains a few different types of variant: structural variants ("SV"), indels ("INDEL") and SNPs ("SNP").



In [7]:

    
%%sql
SELECT 
  vt,
  COUNT(*)
FROM $exp_groups
GROUP BY vt









    Out[7]:





    
    
(rows: 3, time: 0.4s, cached, job: job_l3Z5dkMGEBkpl61jsNfJu6-GXmg)

For the purposes of this experiment, let's limit the variants to only SNPs.



In [8]:

    
%%sql --module snps
SELECT * 
FROM $exp_groups
WHERE vt='SNP'



In [9]:

    
bq.Query(snps,
         exp_groups=exp_groups,
         allele_counts=allele_counts,
         sample_info_table=sample_info_table,
         variants_table=variants_table).sample().to_dataframe()









    Out[9]:






  
    
      
      super_population
      is_case
      call_set_name
      reference_name
      start
      reference_bases
      alternate_bases
      end
      vt
      ref_count
      alt_count
    
  
  
    
      0
      EUR
      False
      HG00261
      12
      18372942
      A
      G
      18372943
      SNP
      2
      0
    
    
      1
      EAS
      True
      HG00593
      12
      18372942
      A
      G
      18372943
      SNP
      2
      0
    
    
      2
      EUR
      False
      NA12749
      12
      18372942
      A
      G
      18372943
      SNP
      2
      0
    
    
      3
      EUR
      False
      HG00150
      12
      18372942
      A
      G
      18372943
      SNP
      2
      0
    
    
      4
      AMR
      False
      NA19675
      12
      18372942
      A
      G
      18372943
      SNP
      2
      0

Tallying reference/alternate allele counts for case/control groups

Now that we've assigned each call set to either the case or the control group, we can tally up the counts of reference and alternate alleles within each of our assigned case/control groups, for each variant position, like so:



In [10]:

    
%%sql --module grouped_counts

SELECT
    reference_name,
    start,
    end,
    reference_bases,
    alternate_bases,
    vt,
    SUM(ref_count + alt_count) AS allele_count,
    SUM(ref_count) AS ref_count,
    SUM(alt_count) AS alt_count,
    SUM(IF(TRUE = is_case, INTEGER(ref_count + alt_count), 0)) AS case_count,
    SUM(IF(FALSE = is_case, INTEGER(ref_count + alt_count), 0)) AS control_count,
    SUM(IF(TRUE = is_case, ref_count, 0)) AS case_ref_count,
    SUM(IF(TRUE = is_case, alt_count, 0)) AS case_alt_count,
    SUM(IF(FALSE = is_case, ref_count, 0)) AS control_ref_count,
    SUM(IF(FALSE = is_case, alt_count, 0)) AS control_alt_count,
FROM $snps
GROUP BY
    reference_name,
    start,
    end,
    reference_bases,
    alternate_bases,
    vt

Again, validate that the results are sensical for the group level counts (still per variant position).



In [11]:

    
bq.Query(grouped_counts,
         snps=snps,
         exp_groups=exp_groups,
         allele_counts=allele_counts,
         sample_info_table=sample_info_table,
         variants_table=variants_table).sample().to_dataframe()









    Out[11]:






  
    
      
      reference_name
      start
      end
      reference_bases
      alternate_bases
      vt
      allele_count
      ref_count
      alt_count
      case_count
      control_count
      case_ref_count
      case_alt_count
      control_ref_count
      control_alt_count
    
  
  
    
      0
      12
      113369176
      113369177
      T
      G
      SNP
      2184
      499
      1685
      572
      1612
      124
      448
      375
      1237
    
    
      1
      12
      84117090
      84117091
      T
      A
      SNP
      2184
      2182
      2
      572
      1612
      572
      0
      1610
      2
    
    
      2
      12
      41729418
      41729419
      G
      A
      SNP
      2184
      1101
      1083
      572
      1612
      283
      289
      818
      794
    
    
      3
      12
      41768110
      41768111
      G
      A
      SNP
      2184
      1121
      1063
      572
      1612
      284
      288
      837
      775
    
    
      4
      12
      26584345
      26584346
      G
      A
      SNP
      2184
      103
      2081
      572
      1612
      0
      572
      103
      1509

Quantify the statistical significance at each variant positions

We can quantify the statistical significance of each variant position using the Chi-squared test. Furthermore, we can restrict our result set to only statistically significant variant positions for this experiment by ranking each position by its statistical signficance (decreasing) and thresholding the results for significance at p <= 5e-8 (chi-squared score >= 29.7).



In [12]:

    
%%sql --module results

SELECT
  reference_name,
  start,
  end,
  reference_bases,
  alternate_bases,
  vt,
  case_count,
  control_count,
  allele_count,
  ref_count,
  alt_count,
  case_ref_count,
  case_alt_count,
  control_ref_count,
  control_alt_count,
  # http://homes.cs.washington.edu/~suinlee/genome560/lecture7.pdf
  # https://en.wikipedia.org/wiki/Yates%27s_correction_for_continuity
  ROUND(
    POW(ABS(case_ref_count - (ref_count/allele_count)*case_count) - 0.5,
      2)/((ref_count/allele_count)*case_count) +
    POW(ABS(control_ref_count - (ref_count/allele_count)*control_count) - 0.5,
      2)/((ref_count/allele_count)*control_count) +
    POW(ABS(case_alt_count - (alt_count/allele_count)*case_count) - 0.5,
      2)/((alt_count/allele_count)*case_count) +
    POW(ABS(control_alt_count - (alt_count/allele_count)*control_count) - 0.5,
      2)/((alt_count/allele_count)*control_count),
    3) AS chi_squared_score
FROM $grouped_counts
WHERE
  # For chi-squared, expected counts must be at least 5 for each group
  (ref_count/allele_count)*case_count >= 5.0
  AND (ref_count/allele_count)*control_count >= 5.0
  AND (alt_count/allele_count)*case_count >= 5.0
  AND (alt_count/allele_count)*control_count >= 5.0
HAVING
  # Chi-squared critical value for df=1, p-value=5*10^-8 is 29.71679
  chi_squared_score >= 29.71679
ORDER BY
  chi_squared_score DESC,
  allele_count DESC

We now run this query over all the variants within chromosome 12 and return the first few most significant variants.



In [13]:

    
bq.Query(results,
         grouped_counts=grouped_counts,
         snps=snps,
         exp_groups=exp_groups,
         allele_counts=allele_counts,
         sample_info_table=sample_info_table,
         variants_table=variants_table).sample().to_dataframe()









    Out[13]:






  
    
      
      reference_name
      start
      end
      reference_bases
      alternate_bases
      vt
      case_count
      control_count
      allele_count
      ref_count
      alt_count
      case_ref_count
      case_alt_count
      control_ref_count
      control_alt_count
      chi_squared_score
    
  
  
    
      0
      12
      110571373
      110571374
      A
      G
      SNP
      572
      1612
      2184
      1813
      371
      220
      352
      1593
      19
      1086.505
    
    
      1
      12
      22509094
      22509095
      T
      C
      SNP
      572
      1612
      2184
      1589
      595
      116
      456
      1473
      139
      1073.082
    
    
      2
      12
      22525352
      22525353
      A
      C
      SNP
      572
      1612
      2184
      1585
      599
      115
      457
      1470
      142
      1068.266
    
    
      3
      12
      22524240
      22524241
      T
      C
      SNP
      572
      1612
      2184
      1584
      600
      115
      457
      1469
      143
      1065.294
    
    
      4
      12
      33240567
      33240568
      C
      G
      SNP
      572
      1612
      2184
      1678
      506
      159
      413
      1519
      93
      1043.029

Scroll to the right in the above results to see that the positions deemed significant do in fact have significantly different case/control counts for the alternate/reference bases.

Computing Chi-squared statistics in BigQuery vs Python vs R

Let's compare these BigQuery-computed Chi-squared scores to ones calculated via Python's statistical packages



In [14]:

    
import numpy as np
from scipy.stats import chi2_contingency

chi2, p, dof, expected = chi2_contingency(np.array([ 
    [220, 352], # case 
    [1593, 19]  # control
]))

print 'Python Chi-sq score = %.3f' % chi2









    



Python Chi-sq score = 1086.505

Comparing Python statistics to those computed in R separately with the result displayed here:

> chisq.test(rbind(c(220, 352), c(1593, 19)))

    Pearson's Chi-squared test with Yates' continuity correction

data:  rbind(c(220, 352), c(1593, 19))
X-squared = 1086.505, df = 1, p-value < 2.2e-16

And we can see for both the computation in Python and R, that the value matches 1086.505 from BigQuery.

Analyzing the GWAS results

First, how many statistically significant variant positions did we find?



In [15]:

    
%%sql 
SELECT COUNT(*) AS num_significant_snps
FROM $results









    Out[15]:





    
    
(rows: 1, time: 0.5s, cached, job: job_LrwWjBy-rCHAGBsE2LpCikoKclA)

We now have a dataset that is sufficiently small to fit into memory on our instance, so let's pull the top 1000 SNP positions locally. Since we only need a subset of the columns, we can project our data first to remove unneeded columns.



In [16]:

    
%%sql --module sig_snps_dataset
SELECT * FROM (
  SELECT
    reference_name,
    start,
    reference_bases,
    alternate_bases,
    chi_squared_score
  FROM $results
  LIMIT 1000
)
ORDER BY start asc



In [17]:

    
sig_snps = bq.Query(sig_snps_dataset, 
                    results=results,
                    grouped_counts=grouped_counts,
                    snps=snps,
                    exp_groups=exp_groups,
                    allele_counts=allele_counts,
                    sample_info_table=sample_info_table,
                    variants_table=variants_table).to_dataframe()
sig_snps[:10]









    Out[17]:






  
    
      
      reference_name
      start
      reference_bases
      alternate_bases
      chi_squared_score
    
  
  
    
      0
      12
      1595431
      A
      G
      744.125
    
    
      1
      12
      1600977
      A
      T
      828.910
    
    
      2
      12
      2129565
      C
      T
      711.515
    
    
      3
      12
      2132494
      T
      C
      753.874
    
    
      4
      12
      2134197
      A
      G
      822.961
    
    
      5
      12
      2138687
      C
      A
      706.309
    
    
      6
      12
      2139264
      A
      C
      769.534
    
    
      7
      12
      2140624
      T
      C
      704.880
    
    
      8
      12
      2141908
      C
      T
      790.150
    
    
      9
      12
      6505528
      T
      G
      768.788

Let's visualize the distribution of significant SNPs along the length of the chromosome. The y-value of the charts indicates the Chi-squared score: larger values are more significant.



In [18]:

    
import matplotlib.pyplot as plt
import seaborn as sns

#g = sns.distplot(sig_snps['start'], rug=False, hist=False, kde_kws=dict(bw=0.1))
fig, ax = plt.subplots()
ax.scatter(sig_snps['start'], sig_snps['chi_squared_score'], alpha=0.3, c='red')
ax.set_ylabel('Chi-squared score')
ax.set_xlabel('SNP position (bp)')









    Out[18]:





<matplotlib.text.Text at 0x7f956b1b8f50>

Let's zoom in on one region that contains a large number of very significant SNPs:



In [19]:

    
fig, ax = plt.subplots()
ax.scatter(sig_snps['start'], sig_snps['chi_squared_score'], alpha=0.5, c='red')
ax.set_xlim([3.3e7, 3.5e7])
ax.set_ylabel('Chi-squared score')
ax.set_xlabel('SNP position (bp)')









    Out[19]:





<matplotlib.text.Text at 0x7f9540de9c50>

Further exploration of statistically significant variant positions

We can take our analysis further by mapping selected variant positions back to the chromosome and visualizing call sets and reads. Let's retrieve the top SNP identified when ranked by the Chi-squared score:



In [20]:

    
%%sql --module top_snp
SELECT start
FROM $sig_snps_dataset
ORDER BY chi_squared_score desc
LIMIT 1



In [21]:

    
bq.Query(top_snp,
         sig_snps_dataset=sig_snps_dataset,
         results=results,
         grouped_counts=grouped_counts,
         snps=snps,
         exp_groups=exp_groups,
         allele_counts=allele_counts,
         sample_info_table=sample_info_table,
         variants_table=variants_table).results()









    Out[21]:





    
    
(rows: 1, time: 0.5s, cached, job: job_2uTBR9MmYxBWlv4lt3cSdw_dQmM)

Grab an arbitrary set of 10 callset IDs for rendering in the genome browser.



In [22]:

    
%%sql --module callset_ids
SELECT * FROM (
  SELECT call.call_set_id AS callset_id
  FROM $variants_table
  GROUP BY callset_id)
LIMIT 10



In [23]:

    
callsets_df = bq.Query(callset_ids, variants_table=variants_table).to_dataframe()
callsets = list(callsets_df['callset_id'])



In [24]:

    
from IPython.display import HTML
def gabrowse(dataset, reference_name, start_position, callset_ids):
    callsets_query_params = ''.join('&callsetId=%s&cBackend=GOOGLE' % callset_id for callset_id in callset_ids)
    url = ('https://gabrowse.appspot.com/#=&backend=GOOGLE&location=12%3A'
         + str(start_position)
         + callsets_query_params)
    return HTML('<iframe src="%s" width=1024 height=800></iframe>' % url)

Now we can render the call sets and reads for the selected SNP position by embedding the GABrowse application directly in our notebook.



In [25]:

    
gabrowse('1000genomes', '12', 110571373, callsets)









    Out[25]:

Summary

This notebook illustrated how to conduct a GWAS experiment using variant data stored within the Google Genomics BigQuery tables, retrieve a local copy of the top results and visualize the data with Python libraries.

	reference_name	start	reference_bases	alternate_bases	end	vt	call_set_name	ref_count
0	12	19434623	A	G	19434624	SNP	HG00261	2
1	12	19434623	A	G	19434624	SNP	HG00593	2
2	12	19434623	A	G	19434624	SNP	NA12749	2
3	12	19434623	A	G	19434624	SNP	HG00150	2
4	12	19434623	A	G	19434624	SNP	NA19675	2

	super_population	is_case	call_set_name	reference_name	start	reference_bases	alternate_bases	end	vt	ref_count	alt_count
0	EUR	False	HG00261	12	18330083	C	A	18330084	SNP	0	2
1	EAS	True	HG00593	12	18330083	C	A	18330084	SNP	1	1
2	EUR	False	NA12749	12	18330083	C	A	18330084	SNP	0	2
3	EUR	False	HG00150	12	18330083	C	A	18330084	SNP	1	1
4	AMR	False	NA19675	12	18330083	C	A	18330084	SNP	0	2

	super_population	is_case	call_set_name	reference_name	start	reference_bases	alternate_bases	end	vt	ref_count
0	EUR	False	HG00261	12	18372942	A	G	18372943	SNP	2
1	EAS	True	HG00593	12	18372942	A	G	18372943	SNP	2
2	EUR	False	NA12749	12	18372942	A	G	18372943	SNP	2
3	EUR	False	HG00150	12	18372942	A	G	18372943	SNP	2
4	AMR	False	NA19675	12	18372942	A	G	18372943	SNP	2

	reference_name	start	end	reference_bases	alternate_bases	vt	allele_count	ref_count	alt_count	case_count	control_count	case_ref_count	case_alt_count	control_ref_count	control_alt_count
0	12	113369176	113369177	T	G	SNP	2184	499	1685	572	1612	124	448	375	1237
1	12	84117090	84117091	T	A	SNP	2184	2182	2	572	1612	572	0	1610	2
2	12	41729418	41729419	G	A	SNP	2184	1101	1083	572	1612	283	289	818	794
3	12	41768110	41768111	G	A	SNP	2184	1121	1063	572	1612	284	288	837	775
4	12	26584345	26584346	G	A	SNP	2184	103	2081	572	1612	0	572	103	1509

	reference_name	start	end	reference_bases	alternate_bases	vt	case_count	control_count	allele_count	ref_count	alt_count	case_ref_count	case_alt_count	control_ref_count	control_alt_count	chi_squared_score
0	12	110571373	110571374	A	G	SNP	572	1612	2184	1813	371	220	352	1593	19	1086.505
1	12	22509094	22509095	T	C	SNP	572	1612	2184	1589	595	116	456	1473	139	1073.082
2	12	22525352	22525353	A	C	SNP	572	1612	2184	1585	599	115	457	1470	142	1068.266
3	12	22524240	22524241	T	C	SNP	572	1612	2184	1584	600	115	457	1469	143	1065.294
4	12	33240567	33240568	C	G	SNP	572	1612	2184	1678	506	159	413	1519	93	1043.029

	reference_name	start	reference_bases	alternate_bases	chi_squared_score
0	12	1595431	A	G	744.125
1	12	1600977	A	T	828.910
2	12	2129565	C	T	711.515
3	12	2132494	T	C	753.874
4	12	2134197	A	G	822.961
5	12	2138687	C	A	706.309
6	12	2139264	A	C	769.534
7	12	2140624	T	C	704.880
8	12	2141908	C	T	790.150
9	12	6505528	T	G	768.788